A Better Good-Turing Estimator 
for Sequence Probabilities 



Aaron B. Wagner 

School of ECE 
Cornell University 

wagner@ece . Cornell . edu 



Pramod Viswanath 

ECE Department 
University of Illinois 
at Urbana-Champaign 

pramodvSuiuc . edu 



Sanjeev R. Kulkarni 

EE Department 
Princeton University 

kulkarnigprinceton . edu 



Abstract — We consider the problem of estimating the prob- 
ability of an observed string drawn i.i.d. from an unknown 
distribution. The key feature of our study is that the length of 
the observed string is assumed to be of the same order as the 
size of the underlying alphabet. In this setting, many letters are 
unseen and the empirical distribution tends to overestimate the 
probability of the observed letters. To overcome this problem, 
the traditional approach to probability estimation is to use 
the classical Good-Turing estimator. We introduce a natural 
scaling model and use it to show that the Good-Turing sequence 
probability estimator is not consistent. We then introduce a novel 
sequence probability estimator that is indeed consistent under the 
natural scaling model. 

I. Introduction 

Suppose we are given a string drawn i.i.d. from an un- 
known distribution. Our goal is to estimate the probability 
of the observed string. One approach to this problem is to 
\ use the type, or empirical distribution, of the string as an 
approximation of the true underlying distribution and then to 
calculate the resulting probability of the observed string. It 
is well known that this estimator assigns to each string its 
largest possible probability under an i.i.d. distribution. For 
large enough observation sizes, this estimator works well; 
indeed, for large n and a fixed underlying distribution, it is 
a consistent sequence probability estimator. 

Motivated by applications in natural language, we focus 
on a nonstandard regime in which the size of the underlying 
alphabet is of the same order as the length of the observed 
string. In this regime, the type of the observation is a poor 
representation of the true probability distribution. Indeed, 
many letters with nonzero probability will not be observed 
at all and the type will obviously assign these letters zero 
probability. This would not make for a consistent probability 
estimator. 

Since probability estimation and compression are closely 
related, we can turn to the compression literature for suc- 
cor. The results in this literature are negative, however. For 
instance, Orlitsky and Santhanam [1] shows that universal 
compression of i.i.d. strings drawn from an alphabet that grows 
linearly with the observation size is impossible. As such, the 
compression literature is unhelpful and even suggests that 
seeking a consistent universal sequence probability estimator 
might be futile. 



Nevertheless, sequence probability estimation is of such im- 
portance in applications that several heuristic approaches have 
been developed. The foremost among them is based on the 
classical Good-Turing probability estimator (see Section IIVI ). 
The idea is to use the Good-Turing estimator instead of the 
type to estimate the underlying probability distribution. The 
probability of the sequence can then be calculated accordingly. 
Orlitsky et al. [3] have studied the performance of a similar 
scheme in the context of probability estimation for patterns. No 
theoretical results regarding the performance of this approach 
for sequence probability estimation are available, however. 



To analyze the performance of this scheme, we introduce 
a natural scaling model in which the number of observations, 
n, and the underlying alphabet size grow at the same rate. 
Further, the underlying probabilities vary with n. The only 
restriction we make is that no letter should be either too rare 
or too frequent. That is, the probability that any given symbol 
occurs somewhere in the string should be bounded away from 
and 1 as the length of the string tends to infinity. In particular, 
this condition requires that the probabilities of the letters be 
0(l/n). We call this the rare events regime. This scaling 
model is formally described in the next section. 



Our model is similar to the one used by Klaassen and 
Mnatsakanov [4] and Khmaladze and Chitashvili [5] to study 
related problems. We used this model previously [6] to show 
consistency of the Good-Turing estimate of the total probabil- 
ity of letters that occur a given number of times in the observed 
string. In the present paper, we use this model to first show that 
the Good-Turing estimator for sequence probabilities performs 
poorly; in fact, a simple example illustrates that it is not 
consistent. Drawing from this example, we then provide a 
novel sequence probability estimator that improves upon the 
Good-Turing estimator — in fact, we show that it is consistent 
in the context of the natural scaling model. This is done in 
Section[V] Finally, we discuss the application of our results to 
universal hypothesis testing problems in the rare event regime 
in Section [VI] 



II. The Rare Events Regime 

Let il n be a sequence of finite alphabets. For each n, let 
p n and q n be probability measures on 0„ satisfying 

c c 

- < mm{p n {ui),qn(u>)) < max(p n (u),q n ((j)) < - (1) 

n n 

for all u) in 0„, where c and c are fixed constants that are 
independent of n. Observe that this requires the cardinality of 
the alphabet size to grow linearly in n 
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We observe two strings of length n. The first, denoted by x, is 
a sequence of symbols drawn i.i.d. from fl n according to p n . 
The second, denoted by y, is a sequence of symbols drawn 
i.i.d. from 0„ according to q n . We assume that x and y are 
statistically independent. Note that both the alphabet and the 
underlying probability measures are permitted to vary with n. 
Note also that by assumption £[), each element of fl n has 
probability 8(l/n) under both measures and thus will appear 
0(1) times on average in both strings. In fact, the probability 
of a given symbol appearing a fixed number of times in either 
string is bounded away from and 1 as n — ► oo. In other 
words, every letter is rare. The number of distinct symbols in 
either string will grow linearly with n as a result. 

Our focus shall be on the quantities p n (x) and p„(y). An 
important initial observation to make is that the distributions 
of these two random variables are invariant under a relabeling 
of the elements of f2„. It is therefore convenient to consider 
the probabilities assigned by the measures p n and q n without 
reference to the labeling of the symbols. It is also convenient 
to normalize these probabilities so that they are 0(1). 

Let P n denote the distribution of 

(np n (x n ),nq n (x n )), 

where x n is drawn according to p n . Likewise, let Q n denote 
the distribution of 

(np n (y„),nq n (y n )), 

where y„ is drawn according to q n . 

Note that both P n and Q n are probability measures on C := 
[c, c] x [c, c]. It follows from the definitions that P n and Q n 
are absolutely continuous with respect to each other and the 
Radon-Nikodym derivative is given by 

dQ„ 



dK {x ' y) 7 



(2) 



Note that many quantities of interest involving p n and q n 
can be computed using P n (or Q n ). For example, the entropy 
of p n can be expressed as 

x 

log - dP n (x,y) 
c n 

and the relative entropy between p n and q n is given by 

D{p n \\q n ) = / log- dP n (x,y). 
Jc V 



We shall assume that P n converges in distribution to a 
probability measure P on C. Since P n and Q n are related 
by ©, this implies that Q n converges to a distribution Q 
satisfying 

dQ , > y 
— (a;,?/ = -. 
aP x 

III. Problem Formulation 

Recall that the classical (finite-alphabet, fixed-distribution) 
asymptotic equipartition property (AEP) asserts that 



lim — log/xfw) = —HijjL) 

n — >oo 77, 



(3) 



where w is an i.i.d. sequence drawn according to /j and H(-) 
denotes discrete entropy. Loosely speaking, (01 says that the 
probability of the observed sequence, /i(w), is approximately 

exp(— nH(fi)). 

In the rare events regime, on the other hand, one expects the 
probability of an observed sequence to be approximately 

h 
it 

for some constant h. Indeed, in the rare events regime the 
following AEP holds true (all proofs are contained in Sec- 
tion rvTit . 

Theorem 1: 



lim 



1 



\og{np n (xi)) = / log(x) dP(x, y) a.s. 



Our goal is to estimate the limit in Theorem Q] universally, 
that is, using only the observed sequence x without reference 
to the probability measures p n . Of course, in the classical 
setup, the analogous problem of universally estimating the 
limit in (O is straightforward. The distribution fi can be 
determined from the observed sequence by the law of large 
numbers, from which the entropy H(/i) can be calculated. In 
the rare events regime, on the other hand, this approach fails 
and the problem is more challenging. 

We shall also study the following variation on this problem. 
Consider the related quantity p n (y). That is, the sequence is 
generated i.i.d. according to q n , but we evaluate its probability 
under p n . This quantity arises in detection problems, where 
one must determine the likelihood of a given realization under 
multiple probability distributions. As in the single-sequence 
setup, it turns out that this probability converges if it is suitably 
normalized. 

Theorem 2: 



lim — 



\og{np n (y,)) = / \og{x) dQ(x, y) a.s. 



Our goal is then to estimate the limit in Theorem [2] using 
only the observed sequences x and y. Again, in a fixed- 
distribution setup, this problem is straightforward because the 
two distributions can be determined exactly from the observed 
sequences in the limit as n tends to infinity. In the rare events 
regime, however, the problem is more challenging. 



IV. The Good-Turing Estimator 

The Good-Turing estimator can be viewed as an estimator 
for the probabilities of the individual symbols. Let A k be the 
set of symbols that appear k times in the sequence x, and 
let tp k = \A k \ denote the number of such symbols. The basic 
form of the Good-Turing estimator assigns probability 



(k + l)<pk+i 



(4) 



to each symbol that appears k < n — 1 times [2]. The case 
k = n must be handled separately, but this case is unimportant 
to us because in the rare events regime the chance that only 
one symbol appears in x is asymptotically negligible. 

The Good-Turing formula can also be viewed as an estima- 
tor for the total probability of all symbols that appear k times 
in x, i.e., p n (Ak). In particular, the ip k in the denominator can 
be viewed as simply dividing the total probability 



(k + l)(p k 



+i 



equally among the ip k symbols that appear k times. In previ- 
ous work, we showed that the Good-Turing total probability 
estimator is strongly consistent in that for any k > 0, 



lim 

n — >oc 



(k + l)<p k 



-1 



lim _p n {A k ) 

dP(x,y)=:X k a.s. (5) 



x k e 3 



fe! 

(see [6], where the notation is slightly different, for a proof 
of a stronger version of this statement). The Good-Turing 
probability estimator in (|4]i gives rise to a natural estimator 
for the probability of the observed sequence x 



n 

k=l 



(k + l)tp k -\ 
nip k 



This in turn suggests the following estimator for the limit in 
Theorem Q] 



71—1 j 

> log 

z — J m 



fc=l 



<fk 



(6) 



This estimator is problematic, however, because for the largest 
k for which (p k > 0, 



(fc + !)¥*-) 
<Pk 



= 0, 



which means that the kth term in (O equals — oo. Various 
"smoothing" techniques have been introduced to address re- 
lated problems with the estimator [2]. Our approach will be 
to truncate the summation at a large but fixed threshold, K 



K 

E 

k=l 



kifk 



log 



(k + l)<fk+i 



In the rare events regime, with probability one it will eventu- 
ally happen that ip k > for all k = 1 , . . . , K, thus obviating 



the problem. By the result in ©, this estimator will converge 
to 

kX k 



^A fc _ilog 



fc=i 



Afe-i 



(7) 



We next show that this quantity need not tend to the limit in 
Theorem Q] as K tends to infinity. 

Let f2„ be the set {1,2,..., 3n}. Suppose that p n assigns 
probability 1 / (4n) to the first 2n elements and probability 
l/(2n) to the remaining n. The distribution q n is obviously 
not relevant here so we shall simply set it equal to p n . 

The resulting distribution P will place mass 1/2 on each 
of the points (1/4, 1/4) and (1/2, 1/2). From Theorem[T] the 
limiting normalized probability of x is —(1/2) log 8. By (Q, 
the Good-Turing estimate converges to 



1 K 

fc=i 



e- 1 /4( 1 / 4 )fe-i 



(* -1)1 



-1/40*-! 



log 



4(l + e" 1 /42fc-i) 



l^ e-V 4 (l/4)^ / 



,-l/4 9 fc-l 



log 



V8' 



Now as K tends to infinity, the second sum converges to the 
correct answer, —(1/2) log 8. But one can verify that every 
term in the first sum is strictly positive. Thus the Good-Turing 
estimator is not consistent in this example. 

The problem is that the Good-Turing estimator is estimating 
the sum, or equivalently the arithmetic mean, of the probabil- 
ities of the symbols appearing k times in x. Estimating the 
sequence probability, on the other hand, amounts to estimating 
the geometric mean of these probabilities. If p n assigns the 
same probability to every symbol, then the arithmetic and geo- 
metric means coincide, and one can show that the Good-Turing 
sequence probability estimator is asymptotically correct. In the 
above example, however, p n is not uniform, and the Good- 
Turing formula converges to the wrong value. In the next 
section, we describe an estimator that targets the geometric 
mean of the probabilities instead of the arithmetic mean, and 
thereby correctly estimates the sequence probability. 



V. A Better Good-Turing Estimator 



Write 



and then let 



2 ' 



^ ^ ( t (m\ (k + e)\(k + e + i)<p k+e+1 



7f 



m=l e=o 



I I m ■ k 



(k + l)tfk+i 
log(c) . 



Note that 7^ is only a function of x and in particular, it 
does not depend on p n . The next theorem shows that for large 



K and M, 



K 



M 



E^ 

is a consistent estimator for the limit in Theorem Q] 
Theorem 3: For any e > 0, 

A' 



lim 



-E log ( npn ( Xi )) ~ E 



7f 



fc=0 



< e a.s. 



(8) 



provided 




max 
where 

c = max(|logc|, | log c|). 
The idea behind Theorem [3] is this. Recall from (0 that 

(fc + iypu+1 = r x k e - x 

n Jc fc! 



lim 



dP(x,y) a.s. 



If one could find a sequence of constants such that 



E 

fc=0 



x k e a 



ak- 



k\ 



log(x) 



on [c, c], then one might expect that 



lim y 



(fc + l)^fe+i 



log(x) dP(x,y) a.s. 



This is indeed the approach we took to find the formula for 

The estimator can be naturally extended to the two-sequence 
setup, namely to the problem of universally estimating p n (y). 

Let (fikj be the number of symbols in Q n that appear k 
times in x and £ times in y. Then let 



M 



7f 



EE(- 

m=l 1=0 



^ J TO - fc! ^ 



log(c)E 



Note that 7^ is a function of x and y. 
Theorem 4: For any e > 0, 



lim 

n — >oc 



^ n if 

- E l °s(npn(yi)) - E 7 



7f 



(=1 



fe=0 



< e a.s. 



provided 



(exp(c)c /c— c\ M+1 c i:s:+1 c \ e 
c \c + cj ' (K + l)l J ~ 2 

This result shows that although we are unable to determine 
p n from x, we are able to glean enough information about p n 
to determine the limit in Theorem |2] 



VI. Universal Hypothesis Testing 

The 7^ estimator leads to a natural scheme for the problem 
of universal hypothesis testing. Suppose that we again observe 
the sequences x and y, which we now view as training data. In 
addition, we observe a test sequence, say z, which is generated 
i.i.d. from the distribution r n . We assume that either r n = p n 
for all n or r n = q n for all n. The problem is to determine 
which of these two possibilities is in effect using only the 
sequences x, y, and z. 

Using Theorem HI one can estimate p n (z) and q n (z), and 
by comparing the two, determine which of the two distribu- 
tions generated z. This will make for a consistent universal 
classifier, without recourse to actually estimating the true 
underlying distributions p n and q n . As a scheme for universal 
hypothesis testing, however, this approach is quite complicated 
and there is no reason to believe it would be optimal in 
an error-exponent sense. We are currently investigating other, 
more direct approaches to the universal hypothesis testing 
problem in the rare events regime. For a discussion of universal 
hypothesis testing in the traditional, fixed-distribution regime, 
see Gutman [7] and Ziv [8]. 

VII. Proofs 

Due to space limitations, we will only prove Theorem [T] 
and sketch the proof of Theorem [3] The proofs of Theorems |2] 
and |4] are similar. 

Lemma 1: 



lim E 

n — >oo 



1 - 

- y2^og(np n (xi)) 



i=l 



log(x) dP{x,y). 



c 



Proof: Note that for any i 

E[log(np n (xi))) = 2J Pn(w)log(np n (u;)) 

u£fl„ 

= / log(a;) dP n (x,y). 
Jc 

Since \og(x) is bounded and continuous over C and P n 
converges in distribution to P, the result follows. ■ 
Lemma 2: 



lim 



1 - 

~zl^>g{np n {xi)) 

1 - 

-22 l og(np n (xi)) 



-E 



a.s. 



Proof: Consider the sum 

n 

^\og(np n {xi)). 

i=l 

If one symbol in the sequence x is altered, then this sum can 
change by at most 

logr- 

c 



It follows from the Azuma-Hoeffding-Bennett concentration 
inequality [9, Corollary 2.4.14] that 




The result then follows by the Borel-Cantelli lemma. ■ 
Note that Theorem Q] follows immediately from Lemmas Q] 
and|H 

The key step in the proof of Theorem[3]is showing that 7^ 
converges to the proper limit. This is shown in the next and 
final lemma. 

Lemma 3: For any e > and any ft > 0, 



lim 

n — >» 

provided 



7f 



log(x) 



exp(— x)x k 
ft! 



dP{x,y) 



ec 



C I c — c 



M+l 



< e. 



C \C + c, 

Proof (sketch): Note that the limit exists by Q. By the 
triangle inequality, 



7f 



where 



7f 



exp(-x)x k 
log(x) - { dP(x, y) 



<|7f-7f| 



7f 



log(z) k] dP(x,y) 



M 



»| = 1 t=Q 



e fm\ (k+t)\ 
i) m • ft! 



X k+e + log(c)A fc . 



(10) 

The first term on the right-hand side of (O tends to zero by (0. 
Now 

M rn 



_Jm\ (k + £)\ 



m=l 1=0 
M 



exp(—x)x k 



k+i 



E 



■ft! 



-(-c)- 



«=0 



£ 7 HT"V dP(x fI ,). 



By the Binomial Theorem, 



m / \ 



Substituting these last two equations into ( TTOb yields 



7f 



xn m cxp(— 



■ft! 



dP(x,y) 

+ log(c)A fc . 



Using the well-known power series 

io g (i+x) = y ± — '- — x m , 

valid for —1 < x < 1, one can show that 

M 



sup 

c<x<c 



log 3+ V - (l- - 

C z — ' TCI V C 



< 



c c — c 



c \ c + c 



M+l 



< e 



by hypothesis. Thus 



7f 



log(a:) ^ dP{x,y) 



^ / xy» exp(-a;)a; fc 



log - 



exp(— 
ft! 



c/ m ■ kl 
x\ exp(—x)x k 



ft! 



Since 



00 / \ t 

Eexp(— xJ.t 
ft! = : 



fc=0 

one would expect from Lemma 3 that for large K and M, 

K 



E 

k=0 



if 



, (9) would be close to 



log(x) dP(x,y). 



Indeed, one can prove Theorem [3] using this approach. The 
details are omitted. 
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